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Programming the Navier-Stokes Computer: 

An abstract machine model and a Visual Editor. 

David Middleton, Tom Crockett and Sherry Tomboulian 
ICASE, NASA Langley Research Center 

The Navier-Stokes Computer (NSC) is intended to apply large numbers of floating 
point ALUs to computational problems that can be expressed using calculations on long 
vectors [Nosenchuck etal. 87, Tomboulian ctal 88]. Since programming language consid- 
erations were ignored in the hardware design, efficient operation will depend on machine 
level programming. The Visual Editor developed here is intended to provide a support- 
ive environment in which the programmer can more effectively write machine language 
programs. 

Any programming system presents the user with a model of computation. The abstract 
Navier-Stokes computer described in this document is an explicitly chosen model for the 
Visual Editor to present. We prefer this approach to having the computational model 
evolve implicitly while the Editor is constructed. We do not use the complete NSC as this 
model for several reasons, in particular because of its complexity and, as yet, lack of stable 
definition. 

The abstract model is a subset of one node in the actual machine. This allows us to 
ignore issues of synchronisation, communication and multiprogramming that arise in the 
actual machine and to avoid implementing features not provided directly by the hardware. 
Naturally, it is hoped that the abstract model presented would waste little of the com- 
putational power of the eventual NSC (at least, for many problems) and that the Editor 
would provide trapdoors to allow the programmer to use the ignored features (although 
possibly only with relative difficulty). 
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The basic philosophy of the Visual Editor is to provide support and verification to 
the programmer building the complex microcode structures since even the abstract model 
remains ill-adapted to compilation tools. That is, the programmer makes all programming 
decisions, in particular, those regarding the allocation of resources; the Editor merely 
indicates errors without suggesting alternatives. A method for programming is developed 
(from which the Visual Editor’s operations are derived) which we hope will simplify the 
programmer’s task. 

This paper deals with three separate things which must be kept distinct: the actual 
node with its abilities, the abstract, subset node with its abilities, and the programming 
method with the attendant verification and abstractions to be provided by the Editor. 
The first section describes the abstract hardware model; the second section describes the 
programming process and the way that the Visual Editor would support it. The paper 
assumes fairly detailed knowledge of the “full node” of the NSC. Since its design is still 
being completed, the level of detail being attempted here will naturally lead to some 
inaccuracies. 

Abstract Navier-Stokes Machine node: a subset of reality. 

An abstract node has three parts: computation units which are connected to form 
pipes, pipelines consisting of pipes and a memory system with DMA controllers which 
assembles and feeds vectors into them and store vectors of results, and a central controller 
which (statically) schedules pipelines. 

Computation Units. 

These comprise 2 shift/delay units and 32 ALU’s. The shift/delay units allow a vector 
arriving from memory to be duplicated with different offsets. Each one contains four 
serially connected FIFO queues whose outputs are available to some ALU’s through the 
local switches. 
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Each ALU has a small internal register set which can be used in several ways, such as 
supplying the constants of an expression being evaluated. Each ALU is symmetrical, that 
is, if it can do A©£, it can do Bo A. In the a< tual node, this may involve switching inputs 
or manipulating the ALU’s output. Each input has a delay unit, called a vector latch, for 
aligning the ALU’s vector operands. The delay period can vary from zero to the size of 
the register set. The ALU’s are not homogeneous with respect to the functions they can 
perform; there are at least three different kinds of ALU depending on whether integer and 
logic or minimum and maximum operations are available. 

The possible connections between ALU’s are restricted in order to increase the num- 
ber available. In particular, the 32 ALU’s are hardwired into 16 Arithmetic/Logic Unit 
Structures (ALS’s) as 4 singlets, 8 doublets and 4 triplets. Programs can only control the 
output destination of the 16 ALU’s at the ou puts of the ALS’s. 

Implementation aspects. 

Actual ALU’s can be internally configured in various ways; these correspond to specific 
uses, such as evaluating recurrences, which can be displayed directly in the high level 
representation of the abstract ALU’s in the Editor. 

For connecting various units together, an actual node contains caches and switches 
(labeled MxF, DxF, DxS, FxF, SxF, FxD), both of which are absent from the abstract 
model. In the transition from the model to an actual node, the caches can be allocated 
in a straight-forward fashion, since the ability to store information between pipelines (by 
using the caches as vector registers) is ignored. The implementation of the switches leads 
to some complicated restrictions that will be explained in the programming section. 

Memory System. 

Memory in the abstract node consists of 16 planes (each with 128 mega-words) with 
hard boundaries between them; specifically, the user must allocate storage for variables so 
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that the inputs and outputs of each pipeline reside in individual disjoint memory planes. 
This exists because in the actual node, significant difficulties or penalties occur if, in a 
single pipeline, one variable spans multiple planes or one plane holds multiple variables. 

In the actual node, 16 Pipeline DMA units (PDMA’s) must be programmed to gen- 
erate and feed vector streams into pipes or to store vector streams back in memory. The 
corresponding input and output blocks in the Editor are given vector specifications which 
consist of an initial address , a count , a stride , a repetition factor and a pipeline delay. The 
first three are obvious; the repetition factor allows a single input value to be repeated or 
several output values to be overlain (thus keeping only the last one); the pipeline delay 
indicates the number of extra values which must be sent at the end or stored in front of 
the actual vector due to the filling and flushing a pipeline requires. The Editor ought to 
be able to deduce vector specifications from information provided during the programming 
process. 

Implementation aspects. 

The facilities provided by the caches and their controllers and the pipeline and mem- 
ory DMA units are used only to implement the vector specifications described above. The 
switches in the actual memory interface (MxD, DxM, MxM, FxM) are absent from the 
abstract node, and in at least two cases, some loophole specification method is necessary. 
The MxM switch would likely be used in cases where a variable needs to reside in differ- 
ent memory planes for different pipelines. Also, communication between different nodes 
(through “hyperspace”) uses these ignored facilities. 

Assuming cache ordering is irrelevant, in each pipeline we arbitrarily number the input 
caches 1 to n and the output caches n+1 to n+k (where n + k < 16). We assume caches are 
initially empty and are flushed after each pipeline finishes, that is, they do not hold results 
to be used from previous computations. This appears to be optional in recent versions of 
the actual machine (each cache has a ‘sticky’ bit, called “read/write”). 
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For the actual node, several independent pipelines (built from disjoint resources) can 
be separately or jointly initiated to operate in overlapping time periods. The definition 
of a pipeline in the abstract node is now extended to include several disjoint pipelines as 
defined earlier. The extension merely allows a pipeline to have disconnected parts. The 
difference from the actual NSC is that these component pipelines may not be dynamically 
scheduled: a set of component pipelines that may operate together are statically scheduled 
by the programmer always to run together. The components would operate in lockstep 
except that individual cache misses might stall one component while others proceed. Each 
component pipeline is internally synchronous; a delay at any of its PDMA’s stalls all the 
PDMA’s in that component. Although cache misses should in fact be predictable (the 
machine is not multiprogrammed), we view cache misses as indeterminate for simplicity. 

For each component pipeline, one PDMA is designated to send an End-Of-Pipeline 
(EOP) interrupt to the central controller when that pipeline finishes. The implementation 
of central control would be responsible for awaiting several PDMA EOP interrupts, one 
from the distinguished PDMA in each component, before initiating the next extended 
pipeline. 

Flow of control 

In an actual node, the central controller initiates pipelines by issuing the appropriate 
long instruction words to the various units described above. (Individual fields can be 
disabled so as not to interfere with other pipelines already in progress). It contains a micro- 
sequencer which selects the appropriate pipelines by executing microcode in a conventional 
way. We ignore the ability in the actual node of the Pipeline Status Table (and several 
other hardwired processes) to initiate pipelines. If this facility cannot be disabled, then 
the pipelines issued in those cases should have all their fields disabled. 
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In the abstract node, a pipeline will generate as many interrupts before it is finished 
as there are disjoint component pipelines. Having initiated a pipeline, the main microse- 
quencer waits for this number of interrupts before proceeding to its next instruction (which 
may be to initiate another pipeline or to execute its own code implementing more complex 
control flow). 

For the Visual Editor, a simple standard block-oriented language could be built on top 
of this machine model. The basic statements, pipeline initiations, would be aggregated 
using looping and conditional statements. Some loophole might be necessary to allow 
specifying blocks of micro-code, particularly, for example, in implementing the logical 
tests associated with control flow statements. 

Conditional Vector Expressions 

The actual node provides a mechanism for performing some conditional computations 
at full speed inside vectorised loops. This includes merging two vectors according to a 
logical mask, a feature which is available in some commercial vector machines. Exam- 
ples (in the language C) that can be implemented as Conditional Vector Expressions 
are “for(i=a ; i<b; i+=c) { if (test(i)) D[i]=f(i); else E[i]=g(i); } B and 
“for(i=a;i<b;i+=c) D[i] = ((test(i)) ? f(i) : g(i)) 

These expressions can be implemented as single pipelines with support from specialised 
hardware in the central controller, called the condition code resolution circuitry. The 
facility will be described by showing how the second example would be executed. 

First, the functions test, f and g are implemented with three component pipes, using 
the appropriate input vector variables. Although not necessarily evident from any diagram, 
these pipes need to be linked so that if one PDMA unit, say in /, suffers a cache miss, 
then the others, including those in the pipelines for g and test, will also stall. Second, the 
memory specifications for feeding these pipes are organised so that the values f(i) and g(i) 
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are available exactly one clock cycle after the function test(i) has generated an appropriate 
flag for the condition code resolution circuitry. Finally, two versions of this aggregate pipe 
are created, differing only in that the output in one comes from the / pipe while that of 
the other comes from the g pipe. The condition code resolution circuit issues one of these 
two pipeline specifications every clock cycle, according to the value of test . As a result, 
either the output of the / pipe or the g pipe is sent to the variable D in memory. The 
actual node must not allow this continual issuing of pipeline instructions to interfere with, 
for example, the counters operating in the PDMA units. 

More details cannot be given about the abstract node’s abilities as regards Conditional 
Vector Expressions until the actual node is better defined. Obviously, large numbers 
of function units, easily exceeding the number available, may be used (inefficiently) to 
maintain full vector speed in this fashion. In such cases, precomputing and storing any of 
the three vectors is an obvious alternative for the programmer to use. 

Neglected abilities of the actual NSC node 

This section is an incomplete in-line appendix listing facilities of the actual node that 
have not been exploited in the subset abstract node. 

Caches contain ‘sticky’ bits allowing data to be stored from one pipeline to 
another. 

The Pipeline Status Table can autonomously initiate pipelines following others’ 
terminations. 

The microcode is tree-structured rather than being flat, enabling specifications 
of memory configuration and pipe structure to be re-used. (This is due to the actual 
hierarchy not matching the abstract node s divisions which strongly interferes with 
any useful sharing of specifications). 
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Programming the abstract node with the Visual Editor. 

This section describes two separate concepts: the programming process for the NSC 
and as a derivative, the interface and support functions to be provided by the Visual Editor. 
A basic motivation for the Visual Editor is that the complexity of the NSC will prevent, at 
least in the near future, the development of compilers that can produce adequately efficient 
code. The goal of the Editor is to provide the user with a support tool for machine level 
programming that presents decisions in a suitable order to minimise undoing previous 
choices, and that validates those choices as they are made. 


The programming process is first described and then demonstrated using two examples 
(expressions to be evaluated within appropriate loops): 


+ U ‘ 




taken from the paper by Nosenchuck, Krist and Zang, and later, X, = a 

recurrence taken from the Livermore Loops [McMahon 86]. 


The programming process. 

The Visual Editor presents a main panel for displaying a pipeline computation flanked 
by side panels which display a menu, variable declarations and the control flow program 
(see Figure 1). 

The first stage o/ programming. 

First, the programmer writes the expression occurring within the loop in a comment 
area at the top of the main panel and second, derives (on paper) a data flow diagram 
which implements that expression. This step is difficult to automate due on the one hand 
to the wide range of variations that can arise, for example, through applying associativity 
and commutativity and selecting common subexpressions, and on the other hand to the 
timing constraints which the machine imposes. These aspects are particularly evident in 
the linear recurrence example. 


8 



OR'i.GETiAL PAHE IS 

OF POOR QUALITY 



Figure 1. Visual Editor layout 

Third, the user constructs a pipe corresponding to this data flow diagram, allocating 
singlets, doublets, triplets and shift/delay units from the menu. This step also includes 
many choices which affect path lengths. Path lengths turn out to be crucial in the linear 
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recurrence example. Restrictions on the operations available in different ALU’s also man, 
ifest themselves in this step. Appropriate techniques to support automation include using 
back-tracking languages such as Prolog and using graph grammars with the ALS’s (and 
the types of function units) as the terminal symbols. 

Along with the ALS’s and shift/delay units, the menu contains input and output blocks 
which generate (most of) the vector specifications for streams moving to and from memory 
planes. 

The input and output blocks in a given pipeline must specify distinct variables. When 
one variable occurs several times in a pipeline (likely with different subscripts), all uses 
must connect to the same input block. In the pipeline, the different subscripts are im- 
plemented by different initial delays; this aligns the vectors entering computational units. 
The different delays can be seen with the variable U in the first example. 

The first stage of analysis and validation. 

Fourth, the Editor verifies that the different path lengths will correctly align the dif- 
ferent vectors in the computation. Starting at each output block, it counts the delay from 
various points in the pipe, moving back towards the input blocks*. 

Mismatches in path lengths can arise when two backward accumulations from output 
blocks arrive at one point in the pipeline. This occurs when a value (either an input or 
a local subexpression) is used more than once. In some cases, this can be remedied by 
choosing a non-zero initial delay at one of the output blocks; this non-zero delay would 
be incorporated into the vector specification created for the PDMA associated with that 
block. This approach of using the PDMA’s ability to pad input vectors with leading and 

* It counts one clock per ALU, one clock per connection (corresponding to traversing 
a Type 1 or 2 switch), and whatever delays were specified by the programmer for vector 
latches and shift/delay registers. The connections to and from a shift/delay unit together 
cause only one clock of delay. There are further delays associated with cache and memory 
connections, which are not yet completely specified. 
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trailing dummy values which output PDMA's will discard, will not work if a common 
subexpression is used multiple times in calculating the same result. In such cases, an 
error is signalled and the programmer inserts delays explicitly by using shift/ delay units 
(a scarce resource) or by using vector latching in the ALU’s along the short paths. 

In the case of a duplicated input variable, i,he various path lengths leading to the input 
block must differ in the same amounts that the subscripts associated with each path differ. 

The final value(s) for the path length of the pipe gives the pipeline delay, which is used 
later in creating vector specifications for the PDMA’s. 

When a value is used more than once, some, complex restrictions in the switch networks 
will also occasionally cause difficulties. The F <F switch connecting the 16 ALS outputs to 
48 ALS inputs is built from three 16x16 permutation networks, thus dividing the 48 ALS 
inputs into 3 disjoint sets, labeled o, b and c. When a single value is sent to multiple ALS 
inputs, those inputs must be distinctly labeled. This limits the fanout to at most three 
and further constrains which sets of inputs can share a value. The labelling details are 
shown in Figure 2. 

We assume that simple numeric limits on fanout are sufficient at this stage and that 
the ability to swap inputs inside the ALS’s can overcome the detailed restrictions, which 
the Editor checks at a later stage. The output of each unit is provided with several counts 
each corresponding to the fanouts of the switch blocks to which that unit provides inputs. 
For example, ALS’s may be connected to other ALS’s with a fanout of 3, to shift/delay 
units with a fanout of 1 or to output PDMA s with a fanout of 1 (via the FxF, FxS and 
FxD switches respectively). In the Editor, connecting an output block to a given ALS 
decrements the ALS’s first count and connecting an ALS input to that ALS decrements 
its second count. Similarly, an input block has a fan-out of 3 to ALS’s and 2 to the 
shift/delay units, and a shift/delay unit has a fan-out of 2 to ALS’s. (Further, quite 
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Figure 2. Labelling of ALS inputs 


serious, restrictions apply to shift/delay outputs in that they can not be connected to all 
ALS’s. This is dealt with later.) 

A shift/delay unit displays four different delays each with respect to the input. The 
Visual Editor will check that the four serial queues which implement a shift/delay unit can 
support the given delays. The values are ordered and a limit of 8192 is placed on each of 
the three consecutive differences and the smallest value. 
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The second stage of programming. 

Fifth, the programmer allocates specific ALS’s to the singlets, doublets and triplets 
that form the pipe, satisfying the tightest restrictions first. 

The greatest restriction on this step is that the outputs from the shift/ delay units only 
connect to two triplets (ALSo and ALSi), two doublets (ALS4 and ALSg) and one singlet 
(ALS12). Furthermore, each of the eight outputs can only be used twice, once as an input 
to ALSo, ALS4 or ALSn, and once as an input to ALSi, ALS$ or ALS12. This restriction 
strongly affects the third step in which the programmer created the pipe from the data 
flow diagram. 

The other difficulty lies with paths that fork, especially from the shift/delay units, 
but also from ALS’s and input blocks. Sharing subgraphs, which causes such forking, 
has a large effect, even on the initial choice of data-flow graph, and so appears hard to 
automate. ALS’s must be allocated so that input labels that connect to any given fork 
point differ. It is not yet evident whether this will be difficult for the programmer. Again, 
the backtracking facilities of Prolog naturally lend themselves to this operation. 

Allocation of remaining ALS’s can be done arbitrarily. 

The third stage of programming. 

Having constructed a pipe to combine streams of values, the programmer next specifies 
how to generate the streams from multidimensional arrays. 

Sixth, the programmer provides a controlling loop whose index is used to generate 
vectors. The loop, modelled on FORTRAN implied loops, is specified at the top of the 
main panel near the textual comment describing the expression. (If the programmer creates 
an extended pipe which has independent components, then each one should have its own 
loop index, however all connected input/output blocks must use the same index). 
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Seventh, the programmer declares the variables in the side panel. A declaration 
looks like “X( 10, 20, 30) in mp3” which means the array X with subscripts in the ranges 
(0..9,0..19,0..29) is stored in memory plane 3. 

Eighth, for each input and output variable, the programmer creates a stream specifi- 
cation which allows values to be read from (or, in the case of output blocks, stored into) a 
sequence of the variable’s locations that differ by a constant stride. A stream specification 
will not allow a variable to be scanned in a transposed order from that of its storage; each 
“column” of the transpose could be fed to a pipeline computation, but explicit looping 
in the control flow program would be needed for the “outer loop”. (It is unclear whether 
PDMA’s in the actual node allow access to variables by “nested loops”). 

A stream specification comprises a variable name with a list of subscript expressions 
which matches the subscripts in the variable declaration. Each is a linear expression in the 
loop index, and where it has a consistent meaning, may be used to cover several consecutive 
indices. 

The abilities and restrictions of stream specifications are best demonstrated through 
some examples (assuming that X, as declared above, is stored with its leftmost subscript 
varying most rapidly). “X C i) ” with a controlling loop of “i=0.6000” scans X in storage 
order. “X(i)” with a loop of “i=0,6000,2” scans X in storage order, only taking alter- 
nate values from the “rows”, and is equivalent to “X(2*i>” with a loop of “1=0,3000”. 
“X(i) B with a loop of “i=0,6000,3” might well be forbidden, or at least flagged, since the 
increment, 3, does not divide the first index range, 10. “X(i)” with a loop of “i=0,200” 
gives the first plane of X. “X(i,l)” with a loop of “1=0,200” gives the second plane of 
X. “X(D” with a loop of “1=0,400” gives the first and second planes of X. “X(1,D” 
with a loop of “i=0,300” gives a second plane of X at a different orientation. No stream 
specification can be made for the third orientation since it cannot be scanned with a single 
stride. “X(i,i,i)” with a loop of “1=0,10” yields a diagonal. 
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Vector specifications are constructed from the control loop, the stream specifications, 
the individual offsets associated with uses of input variables, and pipeline lengths. 

The second stage of analysis and validation. 

Ninth, the Editor verifies that the variables fit in memory as declared and that for 
each pipeline, all the variables are in distinct memory planes. The actual NSC design 
may require that space be left preceding output variables to contain initial garbage values 
generated by the pipe; this is expensive in the case of large strides. 

The number of iterations given by the controlling loop, added to the pipeline’s path 
length indicates the count to be specified to the PDMA’s. Offsets associated with the input 
and output blocks indicate appropriate offsets to the variables’ base addresses, yielding the 
initial addresses for the PDMA’s. 

Subscript expressions within the input and output blocks yield the stride for address 
specifications. Subscript expressions can be checked for out-of-range violations with respect 
to their variable’s declaration. 

It appears that the replication field, while useful for the linear recurrence example 
below, cannot be determined through this model of the Visual Editor. Explicit specification 
by the programmer will be necessary. 

The fourth stage of programming. 

Tenth, the user writes a control flow program to initiate pipeline computations in a 
language similar to assembly language and BASIC. This step cannot be designed in more 
detail until the actual NSC node is better defined. 

Final notes. 

From the subscript expressions and in particular the offsets, it ought to be possible to 
derive delays such as those needed to address neighbours in a grid, as is used in the first 
example. 
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For each pipeline, a particular Pipeline DMA unit is selected to notify the main se- 
quencer when the pipeline completes. Any variable (input or output) seems sufficient for 
this task. 

Distributing variables across memory planes is extremely useful if a pipe can be du- 
plicated since this provides further parallelism. This is the responsibility of the user, and 
requires that variables be manually partitioned into separately named pieces. 

The following appears to be a reasonable method for the Editor to calculate path 
lengths. Where possible, path length discrepancies are resolved by altering initial lengths 
at the output blocks so that all begin with non-negative values and at least one is zero; 
otherwise require the user to insert delays as described above. Calculate the list of path 
lengths for each use of an input variable and the list of offsets derived from the subscripts 
associated with each variable’s use. Compute a third vector of non-negative values such 
that this vector plus (position-wise) the initial path length vector plus the offset vector 
gives a constant vector. This third vector represents the delays that must be inserted at 
the input PDMA’s, and the repeated value in the constant vector is the pipeline delay, 
that is the number of clock cycles which the PDMA’s must run beyond the length given 
by the controlling loop. 

Examples programmed. 

Example #1. 

The expression and a corresponding data-flow graph for the first example are shown 
in Figure 3. This is only one graph chosen from many possibilities to satisfy high-level 
considerations; it is created by applying arithmetic rules, such as, in this case, translating 
a - (6x c) into a + (-6x c). Let us assume that U has dimension 500 s and so U will (barely) 
fit in a single memory plane (128 Mwords). However, generating and U iij>k+1 , which 

are separated by 500,000 locations, is, first, not possible with the shift/delay units and, 
second, would add this amount to the pipeline length, which is otherwise just over 1000. 
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Hence, as described by Nosenchuck ttal ., U is partitioned across three memory planes, 
according to the value of the third subscript modulo three, so that £/, y fc _ x , f/,-,y,*+i and the 
five values *, lie in distinct memory planes. (Since .ft,,,,* is used to update U , it may be 
independently necessary to create two copies of U in two separate sets of memory planes, 
alternately generating each from the other). The five values lying in the same plane, C/ l j A , 
£/,•_ and f/ ti y +1 ,*, are generated using a shift/delay unit from a single 
stream coming from the one memory plane. 
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A pipe created by allocating resources to this data-flow graph is shown in Figure 4. 
Various points demonstrate the decisions made at this point in programming. Since only 
9 ALU’s from the total of 32 are used, this pipe might be duplicated once or twice to 
increase the processing speed proportionally (requiring in turn that the variables U, G 
and R be further partitioned across separate memory planes appropriately). The number 
of memory planes, the lack of a third shift/delay unit and the restrictions on ALS’s which 
can be connected to the shift/delay units prevent a third copy of the pipe from being built. 
Doubling the pipe probably requires dividing U into 6 memory planes to avoid performance 
penalties in accessing memory (although the actual PDMA’s may allow interleaved access 
by two separate pipelines to the same memory plane). 

The triplet and the upper doublet in Figure 4 must be chosen from ALSo, ALSi, ALS* 
and ALS5, since only these can connect to the SxF switch. Restrictions on the shift/delay 
unit to ALS connections can affect the shape of the pipe. For example, the stream of £/,-,•* 
values could be generated from U, + 1( y * with a one element vector latch in the same manner 
as £/,•_ | J>t . Assuming that were the case, those three values could not be combined in a 
single ALS because of the constraints imposed by the SxF network; a single shift/delay 
output can be shared to create U, + 1 J and only because of the reconfiguration 

provided inside ALU’s. 

Figure 5 shows the pipeline delays calculated at various points in the pipe, relative to 
the Ri^k output which has been set to zero. Thick horizontal cuts across paths represent 
the delays of the switching networks. In order that the subscripted variables specified at 
the inputs will all meet correctly aligned, all the paths through the pipe must have the 
same length. The five values derived from variable Ul, U.y*, f/.-jy*, U,- +1( y k , t/,-y_ 1Jt and 
Uij+i'ki have offset expressions of (0,0), (-1,0), (1,0), (0,-1) and (0,1), respectively, and 
require relative delays of 500, 501, 499, 1000 and 0 in order to be properly aligned. The 
lengths of their paths through ALS’s are 9, 11, 9, 9 and 9, respectively, requiring additional 
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delays of 500, 499, 499, 1000 and 0 which are provided by the shift/delay unit. The Editor 
can derive these delays from the offset expressions, and the dimensions from the variable 
declaration, in this case “U(500, 500,500) in mpl”. The variables G , U0 and U 2 with 
pipeline delays of 3, 10 and 10 also require additional alignment delays (1006, 999 and 999 
respectively) which can be provided by the input PDMA’s. The overall pipeline delay, the 
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time before valid data are being generated, is 1009, the delay through the pipe plus the 
longest shift in the shift/delay unit. PDMA’s will transfer 251,009 values, the first 1009 
of which the output PDMA’s should discard. 

Particular ALS’s are now allocated to the pipe. The triplet must be ALSo or ALSj; the 
upper doublet must be ALS 4 or ALSg. The remaining six ALS’s (assuming the displayed 
pipe is duplicated) can be assigned arbitrarily. 

Variable declarations would look something like “UO (500, 600, 167) in mpO”, 

“U1 (600,500, 167) in mpl”, “U2 (600, 600, 167) in mp2”, “U3 (500, 500, 167) in mp3”, 
“U4 (500, 500, 167) in mp4”, “U5 (500, 500, 167) in mp5”, “G (500, 500, 501) in mp6”, 
and “R(500,500,601) in mp7”. It would be pleasant if a controlling loop something 
like “i:0. .600x500x167” could perform one third of the computations throughout the 
volume U. However, since this pipe scans separated planes, G and R, as declared, cannot 
be scanned with a single stride. One possibility might be to store G as but that 

is obviously not a general solution to the problem. This leads to a control loop that 
looks something like “i : 0. . 500x600” for a computation length of 250,000 data. A further 
problem lies in splitting G and R across two memory planes to feed two pipelines. 

The control program would now be written to initiate these pipeline computations. 
First, there must be at least three pipelines, using U0, Ul and U2, respectively, as the 
source of the planes and the other two as the “vertical” neighbours. Further, it appears 
that the present NSC design would need 167 versions of each pipeline for the 167 starting 
addresses that correspond to different planes in the variables; no reasonable facility has 
been described in the NSC for providing run-time parameters to pipeline definitions (other 
than that provided through access to the microcode). 

Comments on Example #1. 

Examining this pipeline computation in detail shows several things. The complexities 
of the abstract machine and its real counterpart support the view that the Editor is merely 
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an aid to user who must control all choices. The two principal steps appear to be deter- 
mining the structure of the pipeline, and allocating variables to memory (which includes 
distributing variables among the planes and partitioning individual variables across mul- 
tiple planes). It seems that the user must have a reasonably accurate idea for appropriate 
choices in these steps before the editing process begins; the programming process described 
above seems only useful for verifying and filling in the details of the final program. 

This example also suggests that a major area for improvement with this machine is 
the memory system. The high computation rates seem to need a higher bandwidth than 
just sixteen paths to memory. Vector specification might well use a hierarchy of strides 
and counts to provide nested loops of access to variables, if the microcode controlling the 
DMA units could issue addresses rapidly enough. 

Example #2. 

In this example, we consider a recurrence taken from the Livermore Loops, X,- = 
Zi{Yi ~ Figure 6 shows a straight-forward implementation. Since there is a delay of 

three from the use of X,_x to the availability of X { at the same input, this pipeline relies on 
the replication facility in the vector specification hardware to discard the two intervening 
garbage values. This approach gives a vector rate of almost 7 MFLOPS, one third of the 
20 MFLOPS provided by two ALU’s, for a utilisation of 33%. There is a body of research 
on solving such first order recurrences in parallel; the following solution is used merely to 
study designing pipelines. 

The above expression can be expanded to X { = Z \Yi - Z i Z i _ l Y i _ 1 + Z i Z i _ 1 Z i _ 2 Y i _ 2 - 
Z> % Z±_ i ^,_2 X,_ 3 , a first pipeline for which is shown in Figure 7. Several choices were made: 
the tree that combines the four terms to be added is deliberately unbalanced to minimise 
the feedback delay involving X. For this pipeline, the Editor ought to be able to associate 
the variable X,^ 3 with the appropriate ALS input in order to verify the delay specified. 
Figure 8 shows the path lengths relative to the output, as they exist before delays are 
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Figure 6. Simple recurrence 

added. The branch labelled F has two paths to the output, but since both are of length 
8, no action is necessary. The branch labelled G, however, leads to two paths of different 
lengths, initially 4 and 6. In response to an error message, the user would insert a two 
clock delay to correct this discrepancy; one possibility, shown in the diagram, is to add a 
two cycle vector latch in the lower doublet ALS. Although the branch labelled H leads to 
two paths of different lengths, the output path of length 1 and the feedback path of length 
4, the Editor should not signal an error since the subscripts for the variables associated 
with the paths, X, and X,_ s differ by the same amount. 

Next, delays are added to align the input variables, which will incidentally resolve 
discrepant path lengths arising at the two remaining branches. From left to right, the 
variables and their pipeline lengths are: Z, with 8, Yi with 8, i with 8, Fi _ 2 with 6, 
Z { _ j with 10, Zi with 10 and with 8. Adding these lengths to the corresponding 
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Figure 7. Recurrence restructured 


subscript offsets yields 8 for 8 for Yf, 7 for Y^, 4 for Yj_ 2 , 9 for Z f _ x , 10 for Z, and 6 for 
The difference between this list and a constant list of all 10’s (the maximum value) 
is 2, 2, 3, 6, 1,0, 4. The remaining six vector latches shown in Figure 8 provide these delays. 
(Note that the two two-clock delays in the triplet might as well have been combined into 
a single two-clock delay in the following ALU (the one doing a subtraction)). 
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Figure 8. Restructured recurrence with pipeline delays 
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The next step is allocating ALS’s following the four branch points, the Y and Z inputs 
and the two internal pipe branches. H is not included since it only feeds a single ALS. 
Figure 9 shows one allocation to these five ALS’s. This step was very easy. Any time there 
is a branch, there are 6 ways to assign the labels a, b and c to the two or three resulting 
paths. It appeared that the four branches in this diagram could be set in any order and 
with any assignment without any backtracking being needed with subsequent branches. To 
test this hypothesis, the labels a, b and c were assigned to the branches without looking 
at the ALS labellings shown in Figure 2. A conflict occurred since both inputs to the 
doublet labelled 6 ended up being labelled a and none of the eight doublets satisfies this. 
Swapping the labels on branch F remedied this. Given the possibility of swapping ALU 
inputs, branch G was the only limiting factor in assigning the triplet to be ALS 3 . ALS 0 
is also possible but is considered more “valuable” since it is a possible output from the 
shift/delay units. Singlet ALSx 8 could be any other except ALSu; singlet ALSu could also 
be ALSib; doublet ALSg could be ALS 4 or ALS$, but these connect to the shift/delay units, 
and doublet ALS 7 could be any except ALSio or ALS n . A promising order of allocation, 
which would address the tightest limits first, would be to allocate ALS’s with inputs from 
branches in the order of singlets before doublets before triplets and then by the number of 
labelled inputs. For this example at least, the cross constraints on ALS allocation due to 
the input labelling shown in Figure 2 presented negligible difficulty to the programming 
process. 

This pipe will generate the vector X starting with X 3 . It is necessary to initialise 
this pipeline by inserting the values X 0 , X t and X 2 into ALS 7 and the FxF switch. These 
might be previously calculated by running the pipeline shown in Figure 6 and placing these 
values in the appropriate places and then shifting to this pipeline in a single time step. 
The actual NSC has this capability. 
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Figure 9. Restructured recurrence with pipeline delays 

This method for calculating the recurrence uses 9 ALS’s to generate values at the full 
20 MFLOP speed for an effective utilisation of 22%. 

Conclusions 

Due to its design and purpose, the Navier- Stokes computer presents a difficult challenge 
to being used at or near full effectiveness. A visual editor with verification facilities has 
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been proposed as a support tool for the programmer in the seemingly necessary task of 
programming the NSC at the hardware level. 

The study of two example applications suggests that the programming method pre- 
sented is appropriate for the NSC’s target: identical calculations performed on large arrays 
of data. 

The study might also indicate useful modifications to the NSC design. In particular, 
the channels to memory need to be much more flexible and probably more numerous. In 
contrast, the extensive restrictions imposed by the use of permutation networks as opposed 
to full cross-bar switches, appear not to affect the programming task noticeably, and so 
may well remain or increase in the quest for more ALU’s. 
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